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Distributed Stochastic Optimization under Imperfect Information 


Aswin Kannan Angelia Nedic Uday V. Shanbhag 


Abstract — We consider a stochastic convex optimization 
problem that requires minimizing a sum of misspeciiied agent- 
speciflc expectation-valued convex functions over the intersec¬ 
tion of a collection of agent-specific convex sets. This misspecifl- 
cation is manifested in a parametric sense and may be resolved 
through solving a distinct stochastic convex learning problem. 
Our Interest lies in the development of distributed algorithms 
in which every agent makes decisions based on the knowledge 
of its objective and feasibility set while learning the decisions 
of other agents by communicating with its local neighbors over 
a time-varying connectivity graph. While a significant body of 
research currently exists in the context of such problems, we 
believe that the misspeciiied generalization of this problem is 
both important and has seen little study, if at all. Accordingly, 
our focus lies on the simultaneous resolution of both problems 
through a joint set of schemes that combine three distinct steps: 
(i) An alignment step in which every agent updates its current 
belief by averaging over the beliefs of its neighbors; (ii) A 
projected (stochastic) gradient step in which every agent further 
updates this averaged estimate; and (iii) A learning step in 
which agents update their belief of the misspeciiied parameter 
by utilizing a stochastic gradient step. Under an assumption 
of mere convexity on agent objectives and strong convexity of 
the learning problems, we show that the sequences generated 
by this collection of update rules converge almost surely to the 
solution of the correctly specified stochastic convex optimization 
problem and the stochastic learning problem, respectively. 

I. Introduction 

Distributed algorithms have grown enormously in rele¬ 
vance for addressing a broad class of problems in arising 
in network system applications in control and optimization, 
signal processing, communication networks, power systems, 
amongst others (c.f. [1], [2], [3]). A crucial assumption in 
any such framework is the need for precise specification 
of the objective function. In practice however, in many 
engineered and economic systems, agent-specific functions 
may be misspecified from a parametric standpoint but may 
have access to observations that can aid in resolving this 
misspecification. Yet almost all of the efforts in distributed 
algorithms obviate the question of misspecification in the 
agent-specific problems, motivating the present work. 

In seminal work by Tsitsiklis [4], decentralized and dis¬ 
tributed approaches to decision-making and optimization 
were investigated in settings complicated by partial coordi¬ 
nation, delayed communication, and the presence of noise. 
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In subsequent work [5], the behavior of general distributed 
gradient-based algorithms was examined. In related work 
on parallel computing [6], iterative approaches and their 
convergence rate estimates were studied for distributing 
computational load amongst multiple processors. 

Consensus-based extensions to optimization with linear 
constraints were considered in [7], while convergent algo¬ 
rithms for problems under settings of general agent spe¬ 
cific convex constraints were first proposed in [8], as an 
extension of distributed multi-agent model proposed in [9], 
and further developed in [10]. In [11] and [12], a problem 
with common (global) inequality and equality constraints 
is considered and distributed primal-dual projection method 
is proposed. A more general case with agents having only 
partial information with respect to shared and nonsmooth 
constraints is studied in [13]. Recent work [14] compares 
and obtains rate estimates for Newton and gradient based 
schemes to solve distributed quadratic minimization, a form 
of weighted least squares problem for networks with time 
varying topology. In [15], a distributed dual-averaging al¬ 
gorithm is proposed combining push-sum consensus and 
gradient steps for constrained optimization over a static 
graph, while in [16], a subgradient method is developed us¬ 
ing push-sum algorithm on time-varying graphs. Distributed 
algorithms that combine consensus and gradient steps have 
been recently developed [17], [18] for stochastic optimization 
problems. In recent work [19], the authors consider a setting 
of asynchronous gossip-protocol, while stochastic extensions 
to asynchronous optimization were considered in [20], [21], 
[22], [23], convergent distributed schemes were proposed, 
and error bounds for finite termination were obtained. All 
aforementioned work assumes that the functions are either 
known exactly or their noisy gradients are available. 

While misspecification poses a significant challenge in 
the resolution of optimization problems, general purpose 
techniques for the resolution of misspecified optimization 
problems through the joint solution of the misspecified 
problem and a suitably defined learning problem have been 
less studied. Our framework extends prior work on deter¬ 
ministic [24] and stochastic [25], [26] gradient schemes. 
Here, we consider a networked regime in which agents 
are characterized by misspecified expectation-valued con¬ 
vex objectives and convex feasibility sets. The overall goal 
lies in minimizing the sum of the agent-specific objectives 
over the intersection of the agent-specific constraint sets. In 
contrast with traditional models, agents have access to the 
stochastic convex learning metric that allows for resolving 
the prescribed misspecification. Furthermore, agents only 
have access to their objectives and their feasibility sets and 


may observe the decisions of their local neighbors as defined 
through a general time-varying graph. In such a setting, we 
considered distributed protocols that combine three distinct 
steps: (i) An alignment step in which every agent updates 
its current belief by averaging over the beliefs of its neigh¬ 
bors based on a set of possibly varying weights; (ii) A 
projected (stochastic) gradient step in which every agent 
further updates this averaged estimate; and (iii) A learning 
step where agents update their belief of the misspecified 
parameter by utilizing a stochastic gradient step. We show 
that the produced sequences of agent-specific decisions and 
agent-specific beliefs regarding the misspecified parameter 
converge in an almost sure sense to the optimal set of 
solutions and the optimal parameter, respectively under the 
assumption of general time-varying graphs and note that this 
extends the results in [8]. 

The paper is organized as follows. In Section [III we 
define the problem of interest and provide a motivation for 
its study. In Section |III] we outline our algorithm and the 
relevant assumptions. Basic properties of the algorithm are 
investigated in Section |IV] and the almost sure convergence 
of the produced sequences is established in Section [V] We 
conclude the paper with some brief remarks in Section |Vl] 

II. Problem Formulation and Motivation 

We consider a networked multi-agent setting with time- 
varying undirected connectivity graphs, where the graph at 
time t is denoted by Q* = {J\f, £*}, J\f = m} denotes 

the set of nodes and f ‘ is the set of edges at time t. Each 
node represents a single agent and the problem of interest is 

m 

minimize E Eb.(x,r,e)] 

i=i 

subject to a; G Xi, 

i=l 

where 6 * G Rf represents the (misspecified) vector of 
parameters, 0*,^)] denotes the local cost function 

of agent i, the expectation is taken with respect to a random 
variable defined as : fl —>• and P) denotes 

the associated probability space. The function tpi : R” x 
Rf X R"^—>■ R is assumed to be convex and continuously 
differentiable in x for all 0 G 0 and all ^ G fl. 

(i) Local information. Agent i has access to its objective 

function 0*, ^)] and its set Xi but is unaware of the 

objectives and constraint sets of the other agents. Further¬ 
more, it may communicate at time t with its local neighbors, 
as specified by the graph 

(ii) Objective misspecification. The agent objectives are 
parametrized by a vector 0* unknown to the agents. 

We assume that the true parameter 0* is a solution to a 
distinct convex problem, accessible to every agent: 

min E[5(0 ,x)], (2) 

where 0 C Rf is a closed and convex set, x : fig —>■ R'' is a 
random variable with the associated probability space given 
by {ng,Te, Pg), while p : Rp x R*" —>• M is a strongly convex 


and continuously differentiable function in 0 for every x- Our 
interest lies in the joint solution of ([T]i-(|2|i: 

{ m m ^ 

y^E[:^i(x, 0*,^)] I a: G n Xi I , 

z=i fci J (3) 

0*Gargmin{E[p(0,x)] |0e0}. 

9 

A sequential approach; Traditionally, such problems are 
approached sequentially: (1) an accurate approximation of 
0* is first obtained; and (2) given 0*, standard compu¬ 
tational schemes are then applied. However, this avenue 
is inadvisable when the learning problems are stochastic 
and accurate solutions are available via simulation schemes, 
requiring significant effort. In fact, if the learning process 
is terminated prematurely, the resulting solution may differ 
significantly from 0* and this error can only be captured in 
an expected-value sense. Thus, such approaches can only 
provide approximate solutions and, consequently, cannot 
generally provide asymptotically exact solutions. Inspired 
by recent work on learning and optimization in a central¬ 
ized regime [25], we consider the development of schemes 
for distributed stochastic optimization. We build on the 
distributed projection-based algorithm [8], which combines 
local averaging with a projected gradient step for agent-based 
constraints. In particular, we introduce an additional layer of 
a learning step to aid in resolving the misspecification. 
Motivating applications: Consensus-based optimization 
problems arise in a range of settings including the dispatch of 
distributed energy resources (DERs) [2], signal processing [], 
amongst others. Such settings are often complicated by 
misspecification; for instance, there are a host of charging, 
discharging and efficiency parameters associated with storage 
resources that often require estimation. 

III. Assumptions and Algorithm 

We begin by presenting a distributed framework for solv¬ 
ing the problem in Q. To set this up more concretely, for all 
i G A/”, we let fi(x,0) = E[(/5i(a:, 0, 5)] for all x and 0 G 0 
and h{6) = E[p(0,x)] for all 0 G 0. Then, problem Q 
assumes the following form: 

m 

x* & argmin f{x,0*), where/(x, 0) = A(a:, 0*), 

e* G argmin h{9). (4) 

ege 

We consider a distributed algorithm where agent i knows 
fi and the set Xi, while all agents have access to h. We 
further assume that ith agent has access to oracles that 
produce random samples Xa:ipi{x,e,0 and Vep(0,x)- The 
information needed by agents to solve the optimization 
problem is acquired through local sharing of the estimates 
over a time-varying communication network. Specifically, at 
iteration k, the Ah agent has estimates Xi G Xi and 0^' G 0 
and at the next iteration, constructs a vector vf, as an average 


of the vectors obtained from its local neighbors, given by: 

m 

:= ^ for all* = 1,..., m and fc > 0, (5) 

i=i 

j k 

where the weights of are nonnegative scalars satisfying 
jyjLi of ^ = 1 are related to the underlying connectivity 
graph over which the agents communicate at time k. 
Then, for i = the ith agent updates its x- and 

0 -variable as follows: 

4+1 := Ux, (4 - ak (V,/,(4,4) + 4)) , (6) 

4+1 := He (4 - Ik (V/r(4) + 4)) , (7) 

where 4 ^ V,(^,( 4 , 4 , 4 ) - V,/,( 4 , 4 ) with 

= E[V,:^,(x, 0 , 4 ], and 4 4 V,p( 4 ,xf) - 

V/i(4) with V/i(4) = E[Ve5(4,x)] for a\\ i G Af and 
all k > 0. The parameters > 0 and 7 ^ > 0 represent 
stepsizes at epoch k, while the initial points x^ G Xi and 
0° G 0 are randomly selected for each agent i. The ith agent 
has access to (Vx/i( 4 ) 4 ) + 4 ) ^^d not (Va:/i( 4 , 4 ))- 
The same is the case with the learning function. At time 
epoch k, agent i proceeds to average over its neighbors’ 
decisions by using the weights in © and employs this 
average to update its decision in ©. Furthermore, agent i 
makes a subsequent update in its belief regarding 0*, by 
taking a similar (stochastic) gradient update, given by ©. 

The weight a{’^ used by agent i for the iterate of agent j 
at time k is based on the connectivity graph Specifically, 
letting 4 bs the set of neighbors of agent i: 

4 = {ieAr|4*}e4}u{t}, 

i k 

the weights af are compliant with the neighbor structure: 

a 4 > 0 if j € 4 and = 0 if j ^ 4 - 

We assume that each graph is connected and that matrices 
are doubly stochastic, as given in the following assumption. 

Assumption 1 (Graph and weight matrices): 

(a) The matrix A(k) (whose (*j)th entry is denoted by a4) 
is doubly stochastic for every k, i.e., X] 4 i = f f°r every 
j and J2jLi = f f°r every i. 

(b) The matrices A{k) have positive diagonal entries, and all 
positive entries in every A(k) are uniformly bounded away 
from zero, i.e., there exists p > 0 such that, for all and 
k, we have a{'^ > rj whenever a 4 > 0 - 

(c) The graph is connected for every k > 0. 

The instantaneous connectivity assumption on the graphs 
can be relaxed by requiring that the union of these graphs is 
connected every T units of time, for instance. The analysis of 
this case is similar to that given in this paper. We choose to 
work with connected graphs in order to keep the analysis 
somewhat simpler and to provide a sharper focus on the 
learning aspect of the problem. 

Next, we define — {Xi,0i), i G J\f} and Xk = 
{( 4 , xDi * G 4 f = 0 , 1 ,..., fc — 1 } for all fc > 1 and 
make the following assumptions on the conditional first and 


second moments of the stochastic errors wf and 4- These 
assumptions are relatively standard in the development of 
stochastic gradient schemes. 

Assumption 2 (Conditional first and second moments): 

(a) I Xk] = 0 and E[4 I -^fe] = 0 for all k and i G J\f. 

(b) E[|| 4||2 I Xk] < and E[|| 4||2 | Tk] < 4 for all k 
and i G fif. 

(c) E[||4lP] is finite for all i G Af. 

We now discuss the assumptions on agent objectives, the 
learning metric and the underlying set constraints. 

Assumption 3 (Feasibility sets): (a) For every i G Af, the 
set Xi C K" is convex and compact. 

(b) The intersection set fl™ is nonempty. 

(c) The set 0 C Rp is convex and closed. 

Note that under the compactness assumption on the sets Xi, 
we have E[||x°|p] < 00 for all i G Af. Furthermore we have 

max \\x^-ip\\<D (8) 

Xi,yi£Xi 

for some scalar D > 0 and for all i. Next, we consider the 
conditions for the agent objective functions. 

Assumption 4 (Agent objectives): For every i G Af, the 
function fi{x,0) is convex in x for every 0 G &. Fur¬ 
thermore, for every i G Af, the gradients Vxfi{x,9) are 
uniformly Lipschitz continuous functions in 0 for all x G 
X^: ||V,/, 4 , 0 “) - Xxh{x,0^)\\ < Lg\\0‘^ - for all 
6 »“, 0^ G 0, all X G Xi, and all i G Af. 

Assumption 5 (Learning metric): The function h is 
strongly convex over 0 with a constant /t > 0 , and its 
gradients are Lipschitz continuous with a constant Rg, i.e., 
\\Xh{0^) -Xh{0^)\\ < Rg\\0^ - 0^\\ for all 0^,0^ G 0. 

By the strong convexity of h, the problem I© has a unique 
solution denoted by 0*. From the convexity of the functions 
fi in X (over R”) for every 0 G 0, as given in Assump¬ 
tion IH these functions are continuous. Thus, when (yjA^Xi 
is nonempty and each Xi is compact (Assumption |3|, the 
problem mina;gnp.jXi S 4 i ^*) has a solution. 

IV. Basic properties of the algorithm 

In this section, we provide some basic relations for the 
algorithm ©-© that are fundamental to establishing the 
almost sure convergence of the sequences produced by the 
algorithm. The proofs of all the results can be found in [27]. 


A. Iterate Relations 


We start with a simple result for weighted averages of a 
finitely many points. 

Lemma 1: Let yi, ..., pm G R" and Ai,..., Am G R, 
with Ai > 0 for all i and X]4i = 1- Then, for any c G R”, 
we have 


Y] AiVi - c 
2=1 


2=1 


-mm 

^ f=i e=i 






Proof: By using the fact that \i are convex weights, we have 
we write 


- c) 

m m 

= X! X! - cf'i.Vj - c)- 

i=l 3 = 1 

Noting that 2aFh = ||a|p + || 6 |p — ||a — 6 |p, valid for any 
a, 6 € M", and applying it to each inner product, we obtain 


m 

Kvi - c 

2=1 


114 +' 

= ||nx, (4 - afc (V./.(4,4) + 4)) - 
<|| 4 -«fe (v,/,( 4 , 4 ) + 4 )-^ir 

<||4-xf+Tj + T|, (9) 

where Tj 4 a^H V,/,( 4 , 4 ) + tuff, ( 10 ) 

r| 4 -2afc(t;f - xfiV^Mvte'^) + tuf). (11) 


Expanding Tf, we obtain that 


A,yi - c 

2=1 

-.mm 

= 9 (lly* -cf + 11% - cf - 11% - %f) 

i=l 3 = 1 

m ^ m m 

= A,11% - cf - 2 y] yi AiAj 11% - %f, 

2=1 2 = 1 j—1 


tX = 4iiv./,(4,4) + 4f = 4iiv./,(4,4) + 4f 
= 4llV./,(t;f,4)f+4ll4f 

+ 2a2(^f)Tv,/,(4,4). 

Taking the conditional expectations on both sides of (fT2l) 
with respect to the past and using Assumption |2] on the 
stochastic gradients, we have that almost surely. 


where the second equality follows by noting 

X TZi ET=1 {\\y^ - cf + hi - cf) 

^™iApl|i/p — cf, which can be seen by using 

eSiA, = i. ■ 

We use the following lemma that provides a bound on the 
difference between consecutive x-iterates of the algorithm 
and an analogous relation for consecutive 0 -iterates. 

Lemma 2: Let Assumptions hold. Also, let X = 
D'ffiXi and let h be strongly convex over 0. Let the iterates 
xf be generated according to (|5l)-(|7]i- Then, almost surely, 
we have for all x G -A and fc > 0 , 

m m 

y^E[iixf+i-xf i-^fe] 

1=1 3 = 1 

-4 W^S-Xef+ maU2S'^+ u‘^)+mal~'"LlD^ 

m 

+ (oifc + 2a\LQ) ^ \\6^ — 6 ** 11 ^ 

2=1 

m 

-2ak'£{Mvt0*)-Mx,9*)), 

2 = 1 

where is a spanning tree in the graph t G ( 0 , 2 ) 
is an arbitrary but fixed scalar, 9* = argmingg 0 h{6), and 
S = maxi max^^x W^xfiix, ^*)||, with X being the convex 
hull of the union UffiXi. 

Proof: Lirst, we note that by the strong convexity of h, 
the point 0 * G 0 minimizing h over 0 exists and it is unique. 
Next, we use the projection property for a closed convex set 
y, according to which we have, ||ny[x] — yf < ||x — yf 
for all y G y and all x. Therefore, for all i, and any x G X, 


E [tX I J^k] 

= 411 v,/f4, 0 *) + v,/.(4, 0 f) - v,/.(4, 0 *)f 
+ 4e [||4f I +24E [(4)^v,/i(4,0f) I Xk] 

'■-V-" '-V-' 

< 1/2 =0 

< 4(liv./i(4,^*)ll + liv./.(uf,0f) - ^xMvtenwr 

Paliy'^ 

<al{2S^ + 2Ll\\9^-9*r + u^), 

( 12 ) 


where in the last inequality we use (a + 6 )^ < 2 a^ + 
2b^ valid for all a,b G K, and the Lipschitz prop¬ 
erty of Vxfi{x,0) (cf. Assumption |4]i. Lurthermore, since 
the sets Xi are compact by Assumption |3l the convex 
hull X of yjffiXi is also compact, implying by continu¬ 
ity of the gradients that max^maxfc>o ||Va;/i( 4 ,^*)ll ^ 
uiaxiUiax^^X ll^s/i(a:, 0*)|| = S, with S < oo. 

Next, we consider the term T^. By taking the 
conditional expectation with respect to Xk and using 
E [(4 — xj^tuf I Xk] = 0, we obtain 


E [tX I Xk] 

= - 2 afe (4 - x)^(V,/.( 4 , 0 f) - V,/.( 4 , 0 *)) 

- 2 afc(uf-x)^V,/i(uf, 0 *) 

< 2afe||4 - x||||V,/.(4,0f) - V,/.(4,0*)|| 

-2ak{v^ -xfVxMvf,0*). 


By using the Lipschitz property of Vxfi{x,d) (cf. Assump- 
tion|4]i, the relation 2aab = 2{Vo?~'^a){\foFb) valid for any 
a, 6 G K and any r > 0, and the Cauchy-Schwarz inequality. 


















we further obtain 

E [T^ I J^k] 

< 2akL4v^ - x\\\\0f - r II - 2ak{v^ - ,0*) 

-2ak{v^-xfVJ,{vt0*) 

<al-^LjD^ + al\\0f-0*r 

-2ak{fiivt0*)-Mx,0*)), (13) 

where in the last inequality we also employ the convexity of 
fi and boundedness of sets X^, together with the fact that 
v^,x G Xi for all i (cf. Assumption O. 

Now, we take the conditional expectation in relation (|9l) 
and we substitute estimates (fT2l i and ( fOl l. which yields 
almost surely, for all i, all x G X and all k, 

E [11x1+^ - xf I Xk] 

< ||t;f - ccf + aU2S^ + 4) + 

+ (ctfc + 2a|Lg) ||6*f — 0*|p 

-2ak {h{vt0*) - h{x,0*)) ■ 


Summing the preceding relations over i = 1,..., m, we have 
the following inequality almost surely, for all a; £ X and all 
k>Q, 

m 

^E[||4+i-xf I J-fc] 

m 

- Yl 11 ^*^ “ + rnal-^LjO'^ 

m 

+ {al + 2alLl)J2\\Sf-0*r 

m 

-2akJ2{M^tO*)-Mx,0*))- (14) 

2=1 


We now focus on the term ||uf — a;|p. Noting that 
J^jLi = 1^ by Lemma [T] it follows that for all a; £ X 
and all fc > 0 , 


luf - xiP = 


af x^ - : 


i=i 


= E - ^ 11 ' - o E E ■ 


i=i 


j=i e=i 


By summing these relations over i, exchanging the order 
of summations, and using = 1 for J ^rid k 

(cf. Assumptionflja)), we obtain for all a; £ X and all fc > 0, 

m 

Eii^'-^ii' 

Z =1 

m ^ m m m 

= E ii^a - - 2 E E E 11 "- 

j=l j=l i=l i=l 


By using the connectivity assumption on the graph and 
the assumption on the entries in the matrix A(fc) (cf. As¬ 
sumptions [Tib) and (c)), we can see that there exists a 
spanning tree C such that 

I m 771 m 

2 EEE“*’^“»’^ii^f 11 ^ ^ E 0^\\^s-4f- 

i=i t=i *=i {s,^}eT'“ 

Therefore, for all x £ and all fc > 0, 

m m 

Elk''-^lP<Ell^a-^ll'-^' E 

j=i a=i {s,^}gT'' 

and the stated relation follows by substituting the preceding 
relation in equation (fTTl i. ■ 

Our next lemma provides a relation for the iterates 0^ 
related to the learning scheme of the algorithm. 

Lemma 3: Let Assumptions [2] and [5] hold, and let the 
iterates 0^ be generated by the algorithm (|5]l-(i7]i. Then, 
almost surely, we have for all fc > 0 , 

m 

^E[|| 0 f+i-rf|j-fe] 

2 = 1 

m 

< (1 - 27fcK + -flRl) || 6 »f - + mjl4, 

i=l 

with 0* = argmingg 0 h{0). 

Proof: By using the nonexpansivity of the projec¬ 
tion operator, the strong monotonicity and Lipschitz con¬ 
tinuity of Veh{0), and by recalling the relation 0* = 
fig [0* — 'yk^eh(0 *)], we obtain the following relation 

1 _ 6»*||2 

< || 0 f - 7fe(Vfc(k) + ) -0*+ 7 feVh(r)f 

= Ilk - 0*r + 7fe^livfc(k) - Vfc(r)f + 

- 27 fc(vfc(k)-vfc(r))^( 0 f-r) 

- 27 fe(k)^(k -0*- 7 kVfc(k) - Vfc(r))) 

< (1 - 2 jkK+ 7 ii?^)iik - o*r+iiiWir 

- 27.(k)^(k -0*- 7kVfc(k) - VM^))). 

Taking conditional expectation with respect to the past X/,, 
we see that almost surely for all i and fc, 

E [||0f+i - rf I Xk] < il-2^kn+jlRj)\\0^-0*r+44, 

since E[(/?f - 0* - jk{^H0^) - Xh{0*))) | J-^] = 0 

and E[||/3j^|p | Xk] < Vg (by Assumption]^. By summing 
the preceding relations over i, we obtain the stated result. ■ 
The following lemma gives a key result that combines 
the decrease properties for x- and 0 -iterates established in 
Lemmas [ 2 ] and 13 

Lemma 4: Let Assumptions hold, and let X = 
OHf^iXi. Let the sequences {a;f},{0f}, i G AT, be generated 
according to (|5]l-(|7]), and define 

m 

V{x^ k; x) := ^ (||x," - xf + ||k - ^ll") for allx £ W 
2=1 







Then, for all a; € X, all /c > 0, and all t € the following 
relation holds almost surely 


E[y(a:'=+\6l'=+i;x) | Tk] < F(a;^ x) + 

li.sleT'' 

m 

+ 2akGY\\^’j - ^"11 - 2afe {f{z\0*) - f{x,e*)) 

i=i 

( r I o 2 r 2 \ 

where 

- m 

= nx[j/^] with 2 /^ = — for all fc > 0, 

i=i 

denotes a spanning tree in Q^, while S,X and G 
are defined as S' = max^ max^g^ ^ = 

conv(U™iX*), and G = maxig^^max^^jc \\Xa;fiiz,e*)\\. 

Proof: By Lemma |2] we have almost surely for some 
T > 0 and for all x € X and all fc > 0, 


-2akY{Mv^^0*)-Mx,9*)) 

i^l 

m 

- 7fc {2>^ - IkRl) Y 11^*' - ^*11' + (15) 

i=l 

Next, we work with the term involving the function values. 
We consider the summand fi(v^,9*) — fi(x,9*). Define 

^ m 

xj, = IVx\y^] for all fc > 0. 
m ■' 
i=i 

By adding and subtracting /^(z^), and by using the convexity 
of fi{-,9*) (see Assumption HI, we can show that 

Mv^,9*)-Mx,9*) 

> U{z\ 9*fiv^ _ r) - Mx, 9*) 

> -||V,/,(z^ 9*)\\\\vf - z'^ll + Mz\ 9*) - Mx, 9*). 

Since Xi is bounded for every £ and € X = j^Xj, it 
follows that 

inax\\Va;Mz'",d*)\\ <G = max (^max ||Va;/j(2/, 0*)||^ . 

k>0 iGJV \y&X J 

Thus, ||V,/i(z^r)||||uf - z'^ll < G||uf - z% implying 


y^E[||x^i-xf |Xfe] 

m 

<Y\\xY^\\"- 9^ Y \\^"s-4r 

j=l {s,^}GT'“ 

+ mal{2S'^ + v'^) + ma\~'" L'Id'^ 

m 

+ {al + 2alM)Y\\(^"-^*\\^ 

i^l 

m 

-2a,Y{M^tn-Mx,n), 

i^l 

where M is a spanning tree in the graph (/*' and 9* = 
argmingg 0 h{9), which exists and it is unique in view of 
the strong convexity of h. By Lemma |3] almost surely we 
have for all fc > 0 , 

m 

YE[\\9Y-dT\Rk] 

m 

< (1 - 27feK + -flRl) Y Pi - 9*P + mjMl 

i=l 

By combining Lemmas |2 and |3] and using the notation V, 
after regrouping some terms, we see that for all x G X and 
all k > 0 , the following holds almost surely; 

E [y(x'=+\6»'=+^;x) I Xfc] < y(x'^',6»'^';x) 

-P Y - 4f + mali2S^ + P) 

{s,£}Gr^ 

m 

+ (al + 2alLl)YP'[-e*r 

2 = 1 


YiMvtn-M^p*)) 

2=1 

m 

>-gy ikf - ^ 11 +/(^^ n - n, 

i=l 

where we also use notation f{-,9) = JMLi fii'P)- 
calling the definition of vf, and by using the doubly- 
stochastic property of the weights and the convexity of 
the Euclidean norm, we can see that JMLi ~ ^^11 ^ 

Eti Er=i =Er=i 11^" - 

m 

2 = 1 

m 

> -cy^ ||x^- - z^W + f{z\9*) - fix,9*). (16) 

i=i 

Using (fThl l in inequality (fTSl l yields the stated result. ■ 

B. Averages and Constraint Sets Intersection 

Now, we focus on developing a relation that will be useful 
for providing a bound on the distance of the iterate averages 
y^ = — V™ xj and the intersection set X = fl^iXi. 

£7 m — l 3 2 —i 2 

Specifically, the goal is to have a bound for YjLi 
which will allow us to leverage on Lemma |4] and prove 
the almost sure convergence of the method. We provide 
such a bound for generic points xi,, Xm taken from sets 
Xi,..., Xm, respectively. For this, we strengthen Assump¬ 
tion [3b) on the sets X^ by requiring that the interior of X 
is nonempty. This assumption has also been used in [8] to 
ensure that the iterates xf G Xi have accumulation points 
in X. This assumption and its role in such set dependent 
iterates has been originally illuminated in [28]. 



Assumption 6: There exists a vector x G int(X), i.e., 
there exists a scalar S > 0 such that {^1 lk-S||<<5}cX. 
By using Lemma 2(b) from [ 8 ] and boundedness of the 
sets Xi, we establish an upper bound for JJJLi W^j ~ 
*^]ll for arbitrary points Xi G Xi, as given in 
the following lemma. 

Lemma 5: Let Assumptions [3 and | 6 ] hold. Then, for the 
vector X = ^ with Xi G X^ for all I, we have 

/ mD\ 

- nx[®]|| < m ( 1 + — j m|x \\xj -Xi\\. 

j=i ^ 2 

Under the interior-point assumption, we provide a re¬ 
finement of Lemma 01 which will be the key relation for 
establishing the convergence. 

Lemma 6: Let Assumptions hold, and let X = 
n^iXi. Let the sequences {x^},{d^}he generated accord¬ 
ing to (l5]l-(l7]i. Then, almost surely, we have for all x G X, 
all fc > 0, and all £ G Af, 

E[V{x>^+\d’^+^;x) I Tk] < V{x\0>^-x) 

- max 

\m — 1 J j,seAf ■' 

-I- ma\{2S^ + v^) + ma\~'^L'I d'^ 

+ al-^G^m^ (^1 + - 2afe (/(z^ 9*) - f{x, 9*)) 

-luUtt- _ r f 

V 

where cr > 0 , while z^, and other variables and constants 
are the same as in Lemma 0] 

V. Almost sure convergence 

We now prove the almost sure convergence of the se¬ 
quences produced by the algorithm for suitably selected 
stepsizes ak and 7 ^. In particular, we impose the following 
requirements on the stepsizes. 

Assumption 7 (Stepsize sequences): The steplength se¬ 
quences {ak} and { 7 fc} satisfy the following conditions : 

00 00 00 

^ 7 fc=oo, ^ 7 fe<oo, ^Q;fc = oo, 

k—0 k—0 k—0 

and for some r G (0, 2), 

CXD j. 

al~^ < 00 , lim — = 0 . 

k—¥oo 'Wi. 

k^O 

Example for the stepsizes: A set of choices satisfying the 
above assumptions are 7 ^ = k~°‘^ and ak = where 

• 1 > 02 > oi > i; 

• 02(2 —r)>l => r<2 — 1 / 02 ; 

• Oi < ra2 =:> r > aila^. 

There is an infinite set of choices for (ai,a 2 ,T) that sat¬ 
isfy these conditions; a concrete example is (ai,a 2 ,T) = 
(0.51,0.9,0.75). Note that 02 > oi implies that the 
steplength sequence employed in computation decays faster 
than the corresponding sequence of the learning updates. 


To analyze the behavior of the sequences {9^}, i G Af, we 
leverage the following super-martingale convergence result 
from [29, Lemma 10, page 49]. 

Lemma 7: Let {ufc} be a sequence of nonnegative random 
variables adapted to cr-algebra fPk and such that almost surely 

E[ufe+i I fk] < (1 - Uk)vk + Ufc for all fc > 0 , 

where 0 < Ufc < 1, Ufe > 0 , Uk = 00 , < 00 , 

and linifc^oo ^ = 0. Then, almost surely limfc_>oo Vk = 0. 

‘^k 

Next, we establish a convergence property for the 9- 
iterates of the algorithm. 

Proposition 1 (Almost sure convergence of {9^}): Let 
Assumptions in and |5] hold. Also, let jk satisfy the conditions 
of Assumption I 2 ] Let the iterates 9^ be generated according 
to ©-([Til. If 9* = argmingg 0 h{9), then 9^ —?► 9* as 
fc —> c» in an almost sure sense for i = 1,..., N. 

Proof: We provide a brief proof. By Lemma 0] almost 
surely for all k > 0, 

m 

y]E[||0f+i-rf I j-,] 

i^l 

m 

< (1 - 27 fcK -b jIrI) || 6 »f - 9* f -b ni'ylul 

i=l 

Using Assumption I 2 ] we can show that for all fc > fc, 7 fe < 
Then, we have almost surely 

Hq 

m m 

Y E [\\oY - I < (1 - 7fc«) E 11 ^" - ^*11' 

+ m'ylv^. 

To invoke Lemma |7] we define Vk = \\9^ — 0*||^. 

Furthermore, Uk = 7 fcK and Ufc = '^'Ik’^g for k > 
0. We note that J2k>o'‘^k = 00 , J2T=o'‘Pk < 00 , and 
limfe^oo ^ = 0 by Assumption |7] Thus, Lemma |7] applies 
to a shifted sequence {vk}k>k ^rid we conclude that Vk ^ 0 
almost surely. ■ 

Now, we analyze the behavior of x-sequences, where we 
leverage the following super-martingale convergence theo¬ 
rem from [29, Lemma 11, page 50]. 

Lemma 8: Let Vk,Uk,'fk and 5k be nonnegative ran¬ 
dom variables adapted to a cr-algebra (Fk- If almost surely 
< 00 , E^o V'fe < 00 , and 

E[c;fc+i I Fk] < (1 + Uk)vk - 5k+ Ufe for all A: > 0, 

then almost surely 14 is convergent and E^o < 00 . 

As observed in Section [Till under continuity of the func¬ 
tions fif, 9) (in view of Assumption01) and the compactness 
of each Xi, the problem has a 

solution. We denote the set of solutions by X*. Therefore, 
under Assumptions 0] 0] and 0] the problem (01i has a 
nonempty solution set, given by X* x {0*}. 

We have the following convergence result. 

Proposition 2 (Almost sure convergence of {x}}): Let 
Assumptions [T}0] hold, and let X = CffL-^^Xi. Let the 
sequences {Xi},{9'f} be generated according to ©- 0 . 




Then, the sequences {x^} converge almost surely to the 
same solution point, i.e., there exists a random vector 
z* G X* such that almost surely 

lim = 2 * for all j G Af. 

k->-oo ^ 

Proof: In Lemma |4l we let x be an optimal solution 
for the problem min^^gn-iXi Yl'ili i-®-’ x = x* 

with X* G X*. Thus, by Lemma |4] we obtain almost surely 
for any x* G X*, all fc > 0, and all £ G Af, 

E[V{x>^+\9'^+^-x*)\Xk] 

<V{x\e>^;x*)-(^^-a^) max ||4'-x^||2 

\m — 1 / j,s€X 

+ mal{2S‘^ + v"^) + mal~^ L'Id'^ 

+ air’CV (^1 + - 2a, (/(^‘,r) - /(i*,«*)) 

- 7, (2k - f; no* _ ».||2 

f i=l 

Pm-ilvl. 


Next, since a is an arbitrary positive scalar, we let cr = t 
where r G (0, 2) is obtained from Assumption Q Further¬ 
more, let ifk be defined as follows: 

-ipk = mal{2S'^ + v^) + TO7fe^'e 

+ al-^ (mLlD^ + (^1 + . 


Using the assumptions on the stepsizes we can show that 
2 

for all k > kn, we have — al > e and 

— U’ m —1 K — 


2k - 


Ik 


> 0 . 


Therefore, almost surely for all x* G X*, all k > k^, and 
all £gN, 


{V{x^,9^-,x*)} is convergent almost surely for every x* G 
X*, we can conclude that 

{Si^i 11^? ~ 2 ;*|p} is convergent a.s. V x* G X*. (19) 
Since Jf,T=o relation (fTsT i implies that 

liminf/(^^r) = /^ (20) 

k—¥<x> 

where /* is the optimal value of the problem, i.e., /* = 
f{x*,9*) for any x* G X*. The set X is bounded (since each 
Xj is bounded by assumption), so the sequence {z^} C X 
is also bounded. Let /C denote the index set of a subsequence 
along which the following holds almost surely: 

lim /(z^r) =liminf/(z^r), 
k^oo,k^}C k^oo 

lim = z* with z* e XL (21) 

k—¥oo,k^K 

We note that /C is a random sequence and z* is a randomly 
specified vector from X*. Further, relation (Ell implies 
that all the sequences {x^}, j = 1,... ,m, have the same 
accumulation points (which exist since the sets Xj are 
bounded). Moreover, since {Xj} C Xj for each j G Af, 
it follows that the accumulation points of the sequences 
{Xj}, j = l,...,m, must lie in the set X = (IJfiXj. 
Without loss of generality we may assume that the limit 
limfc_>oo.feeAC Xj exists a.s. for each j, so that in view of 
the preceding discussion we have almost surely 

lim x^ = X, with x G X, 

k—f(yD,k^K 


1 HI 

lim = lim — x^ = x. 

k—¥oo,kG)C k—^oo,k^K Ul ' ^ 

Then, by the continuity of the projection operator v i—fix [u] 
and the fact Zfc = Iix\y^], we have almost surely 


E[V{x'^+^,9^+^-x*) I Xk] < Vix^,9’^;x*) 

- e max ||xj - - 2ak {f{z'^,9*) - f{x*,9*)) +tpk- 

Recall that z*^ = T\x[y’^] with = X In view of 

optimality of x*, we have f{z^,9*) — f{x*,9*) > 0 for all 
k and x* G X*. Furthermore, the conditions on the stepsizes 
in Assumption |7] Then, we verify that the conditions of 
Lemma[8]are satisfied for the sequence {V{x^, 9^-, x*)}fe>feo 
for an arbitrary x* G X*. By Lemma 0 it follows that 
V (x^, 9^; X*) is convergent almost surely for every x* G X*, 
and the following hold almost surely: 

OO 

V max ||x^' - xjf < OO, (17) 

OO 

J2ak(f(z^9*)-f(x*,9*))<oo. (18) 

fe=0 

By Proposition [T] we have that 9^ —?> 9* al¬ 

most surely for all i G Af. Since V{x^,9^-,x*) = 

~ X\\9’f — 9*\\'^) and the assertion that 


lim z''= lim nx[/]=i. 

k—foo,k^K. k—foo,k^K, 

The preceding relation and (l2Tli yield x = z*, implying that 
for all j almost surely 

lim x^ = z*, with z* G X*. (22) 

k—foo,k^K. 

Then, we can use x* = z* in relation ( fT^ to conclude that 
||xf^ — z*||^ is convergent almost surely. This and the 
subsequential convergence in (l22li imply that W^i ~ 

z*|p —>■ 0 almost surely. ■ 

Special cases: We note two special cases of relevance which 
arise as a consequence of Propositions [T] and |2] 

(i) Deterministic optimization and learning: First, note 
that if the functions fi{x,9) and h{9) are deterministic in 
that the gradients Xxf{x, 9) and Vh{9) may be evaluated at 
arbitrary points x and 9, then the results of Propositions [T] 
and 12 show that limfc^oo 9^ = 9* and limfc_>oo x^ = x* for 
some X* G X* and for alH = 1, ..., m. 

(ii) Correctly specified problems: Second, now suppose 
that the parameter 9* is known to every agent, so there 





is no misspecification. This case can be treated under al¬ 
gorithm (|5]l-(|7]i where the iterates are all hxed at 6*. 
Formally this can be done by setting the initial parameters 
to the correct value, i.e., 0° = 9* for all i, and by using 
the fact that the function h{9*) is known, in which case the 
algorithm reduces to: for all i = 1,..., m and fc > 0, 

(23) 

ak{^.MvtO*) + w^))- (24) 

By letting Fi[x) = fi{x,9*) we see that, by Proposition |2] 
the iterates of the algorithm (l2^ - (l24l) converge almost surely 
to a solution of problem min^gn™ YllLi Fi{x). Thus, the 
algorithm solves this problem in a distributed fashion, where 
both functions and the sets are distributed among the agents. 
In particular, this result when reduced to a deterministic case 
(i.e., noiseless gradient evaluations) extends the convergence 
results established in [8] where two special cases have been 
studied; namely, the case when Xi = X for all i, and the 
case when the underlying graph is a complete graph and all 
weights are equal (i.e., a{’^ = — for all j and k > 0). 
Rate of convergence: While standard stochastic gradient 
methods achieve the optimal rate of convergence in that 
E[/(a:fe,0*)] — E[/(a;*,0*)] < 0{l/k) in the correctly 
specihed regime, it remains to establish similar rates in 
this instance particularly in the context of time-varying 
connectivity graphs. Such rate bounds will aid in developing 
practical implementations. 

VI. Concluding Remarks 

Traditionally, optimization algorithms have been devel¬ 
oped under the premise of exact information regarding 
functions and constraints. As systems grow in complexity, 
an a priori knowledge of cost functions and efficiencies is 
difficult to guarantee. One avenue lies in using observational 
information to learn these functions while optimizing the 
overall system. We consider precisely such a question in 
a networked multi-agent regime where an agent does not 
have access to the decisions of the entire collective, and 
are furthermore locally constrained by their own feasibility 
sets. Generally, in such regimes, distributed optimization can 
be carried out by combining a local averaging step with a 
projected gradient step. We overlay a learning step where 
agents update their belief regarding the misspecihed param¬ 
eter and examine the associated schemes in this regime. 
It is shown that when agents are characterized by merely 
convex, albeit misspecihed, problems under general time- 
varying graphs, the resulting schemes produce sequences that 
converge almost surely to the set of optimal solutions and 
the true parameter, respectively. 
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